This notebook contains the analysis of the data.
The notebook consists of four sections. In the first section, all the necessary preparatory activities are carried out. The second section analyzes which regions have good long-distance transport connections. Then, the stations are analyzed in the context of their centrality and long-distance accessibility. In the last section, the characteristics of long-distance trains are analyzed in detail.
Note: All database queries and most data transformations are performed with the cypher language (query language of neo4j databases)
In this chapter, all preparatory activities will be performed.
First, we need to load all the required packages. Besides the public packages (e.g.plotly and pandas) we import the custom package reportModule. This package contains a custom class for connecting to a neo4j database and querying data from it. We copied the code for creating this class from the website towards datascience and made minor changes for our project.
import json
from neo4j import GraphDatabase
import pandas as pd
import numpy as np
# plotly imports
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# custom imports
import reportModule as rm
Next, a geojson file with the borders of the German counties is loaded. This data is needed for plotting maps of Germany.
# read geojson file
with open('data/georef-germany-kreis.geojson') as f:
geodata = json.load(f)
The last preparatory activity is to establish a connection to the neo4j database by creating a new instance of the neo4j connection class. During instantiation, all authentication data is set.
conn = rm.Neo4jConnection(uri="bolt://localhost:7687",
user="neo4j",
pwd="admin",
db="deutschebahn")
This chapter analyzes which regions have good long-distance transportation connections and the characteristics of these regions.
First, the required data is retrieved from the database and stored in a Pandas data frame. The following characteristics are queried per county:
The last two characteristics contain information about the supply of long-distance trains in a county. The number of trains variable contains the number of trains that pass through a county. Each train is counted only once, regardless of how many stops the train makes in that county. In contrast, the number of train stops variable takes into account how many stations in a county the train stops at.
Note: The data must be queried in two database queries and then linked together. The first queries the socio-economic data and the number of trains per country. The second queries the number of train stops per country.
# query socioeconomic data and number of trains per county
query = """MATCH (c:County)-[]->(st:State)
OPTIONAL MATCH (c)<-[]-(s:Station)-[a:TRAIN]-()
RETURN st.name AS stateName,
st.region AS region,
c.name AS countyName,
c.countyId AS countyId,
c.areaSize AS countySize,
c.population AS countyPop,
c.population/c.areaSize AS countyPopDensity,
c.unemploymentRate AS countyUnemp,
c.gdpPerCapita AS countyGdpPerCap,
c.gdp AS countyGdp,
COUNT(DISTINCT s.stationKey) AS numberOfStations,
COUNT(DISTINCT a.trainNumber) AS numberOfTrains"""
county_baseData = conn.query(query, returnDataFrame=True)
# query number of train stops per county
query = """MATCH (c:County)
OPTIONAL MATCH (c)<-[]-(s:Station)-[a:TRAIN]-()
WITH c,
s.stationKey AS stationKey,
COUNT(DISTINCT a.trainNumber) AS number_of_trains
RETURN c.countyId AS countyId,
SUM(number_of_trains) AS numberOfTrainStops"""
county_numberOfStops = conn.query(query, returnDataFrame=True)
# join data frame
regionalAnalysis = pd.merge(county_baseData, county_numberOfStops, how="left", on="countyId")
After querying the data, all variables are evaluated individually. This is done in a map of Germany. Buttons can be used to switch between the variables.
# create plotly map
figMap = px.choropleth_mapbox(regionalAnalysis,
geojson=geodata,
locations='countyId',
color='numberOfTrains',
featureidkey="properties.krs_code",
color_continuous_scale=["#ffffff", "#ec0016"],
title="Number of trains per country",
mapbox_style="white-bg",
zoom=4.6,
center={"lat": 51.1657, "lon": 10.4515},
hover_name='countyName',
hover_data={'countyId': False,
'numberOfTrains': True,
'numberOfTrainStops': True,
'countyGdp': True,
'countyGdpPerCap': True,
'countyUnemp': True,
'countyPopDensity': True,
'countyPop': True,
'countySize': True},
labels={'numberOfTrains': 'Number of trains',
'numberOfTrainStops': 'Number of train stops',
'countyGdp': 'GDP',
'countyGdpPerCap': 'GDP per Capita',
'countyUnemp': 'Unemployment rate',
'countyPopDensity': 'Population density',
'countyPop': 'Total Population',
'countySize': 'County Size'})
# update geos
figMap.update_geos(fitbounds="locations", visible=False)
# update layout
figMap.update_layout(updatemenus=[{"type": "buttons",
"buttons": [{"args": [{"z": [regionalAnalysis["numberOfTrains"]]},
{"title.text": "Number of trains per county"}],
"label": "A Number of trains",
"method": "update"},
{"args": [{"z": [regionalAnalysis["numberOfTrainStops"]]},
{"title.text": "Number of train stops per county"}],
"label": "B Number of train stops",
"method": "update"},
{"args": [{"z": [regionalAnalysis["countyGdpPerCap"]]},
{"title.text": "GDP per Capita per county"}],
"label": "C GDP per capita",
"method": "update"},
{"args": [{"z": [regionalAnalysis["countyUnemp"]]},
{"title.text": "Uenmployment rate per county"}],
"label": "D Unemployment rate",
"method": "update"},
{"args": [{"z": [regionalAnalysis["countyPopDensity"]]},
{"title.text": "Population density per county"}],
"label": "E Population density",
"method": "update"}],
"showactive": True}],
title_x=0.5,
height=600,
autosize=True,
margin={"r": 100, "t": None, "l": 0, "b": 0},
coloraxis_colorbar={"title": None})
# show map
figMap.show()
The variables are discuessed in the following:
After analyzing the variables individually, the relationship between the socioeconomic variables and the long-distance train supply in a county is analyzed using scatter plots. The socioeconomic variable is placed on the x-axis, and the long-distance transportation supply variable is placed on the y-axis. Two scatter plots are created for each socioeconomic variable. The left diagram contains the number of trains on the y-axis, the right diagram the number of train stops. All East German counties are marked in red, all West German counties in blue.
The first pair of scatter plots contains the correlation of long-distance service with the variable population density. The following question is to be answered: Are sparsely populated regions worse connected to long-distance transport?
# create subplot object
fig = make_subplots(rows = 1,
cols=2,
column_titles=["Number of trains", "Number of train stops"])
# create left scatter plot (y = number of trains)
plotNumberOfTrains = px.scatter(regionalAnalysis,
x="countyPopDensity",
y="numberOfTrains",
color="region",
hover_name="countyName",
color_discrete_map={"East Germany": "red", "West Germany": "blue"},
opacity=0.9,
trendline="ols",
labels={"countyPopDensity": "Population density",
"numberOfTrains": "Number of trains",
"region": "Region"})
# add traces
fig.add_trace(plotNumberOfTrains.data[0], row=1, col=1)
fig.add_trace(plotNumberOfTrains.data[1], row=1, col=1)
fig.add_trace(plotNumberOfTrains.data[2], row=1, col=1)
fig.add_trace(plotNumberOfTrains.data[3], row=1, col=1)
# create left scatter plot (x = number of train stops)
plotNumberOfTrainStops = px.scatter(regionalAnalysis,
x="countyPopDensity",
y="numberOfTrainStops",
color="region",
hover_name="countyName",
color_discrete_map={"East Germany": "red", "West Germany": "blue"},
opacity=0.9,
trendline="ols",
labels={"countyPopDensity": "Population density",
"numberOfTrainStops": "Number of train stops",
"region": "Region"})
# add traces
fig.add_trace(plotNumberOfTrainStops.data[0], row=1, col=2)
fig.add_trace(plotNumberOfTrainStops.data[1], row=1, col=2)
fig.add_trace(plotNumberOfTrainStops.data[2], row=1, col=2)
fig.add_trace(plotNumberOfTrainStops.data[3], row=1, col=2)
# add axis titles
fig["layout"]["xaxis"]["title"]="Population density"
fig["layout"]["xaxis2"]["title"]="Population density"
fig["layout"]["yaxis"]["title"]="Number of trains"
fig["layout"]["yaxis2"]["title"]="Number of train stops"
# update layout
fig.update_layout(template="plotly_white")
# only show distinct legend entries
names = set()
fig.for_each_trace(
lambda trace:
trace.update(showlegend=False)
if (trace.name in names) else names.add(trace.name))
# plot figure
fig.show()
The analysis shows that there is a positive correlation between population density and the offer of long-distance trains in a county, i.e. sparsely populated regions are less well connected. However, the correlation is very noisy.
The next pair of scatter plots contains the correlation of long distance train offer with the variable unemployment rate. The following question is to be answered: Are counties with a high unemployment rate less well connected to long-distance rail service?
# create subplot object
fig = make_subplots(rows = 1,
cols=2,
column_titles=["Number of trains", "Number of train stops"])
# create left scatter plot (y = number of trains)
plotNumberOfTrains = px.scatter(regionalAnalysis,
x="countyUnemp",
y="numberOfTrains",
color="region",
hover_name="countyName",
color_discrete_map={"East Germany": "red", "West Germany": "blue"},
opacity=0.9,
trendline="ols",
labels={"countyUnemp": "Unemployment rate",
"numberOfTrains": "Number of trains",
"region": "Region"})
# add traces
fig.add_trace(plotNumberOfTrains.data[0], row=1, col=1)
fig.add_trace(plotNumberOfTrains.data[1], row=1, col=1)
fig.add_trace(plotNumberOfTrains.data[2], row=1, col=1)
fig.add_trace(plotNumberOfTrains.data[3], row=1, col=1)
# create left scatter plot (x = number of train stops)
plotNumberOfTrainStops = px.scatter(regionalAnalysis,
x="countyUnemp",
y="numberOfTrainStops",
color="region",
hover_name="countyName",
color_discrete_map={"East Germany": "red", "West Germany": "blue"},
opacity=0.9,
trendline="ols",
labels={"countyUnemp": "Unemployment rate",
"numberOfTrainStops": "Number of train stops",
"region": "Region"})
# add traces
fig.add_trace(plotNumberOfTrainStops.data[0], row=1, col=2)
fig.add_trace(plotNumberOfTrainStops.data[1], row=1, col=2)
fig.add_trace(plotNumberOfTrainStops.data[2], row=1, col=2)
fig.add_trace(plotNumberOfTrainStops.data[3], row=1, col=2)
# add axis titles
fig["layout"]["xaxis"]["title"]="Unemployment rate"
fig["layout"]["xaxis2"]["title"]="Unemployment rate"
fig["layout"]["yaxis"]["title"]="Number of trains"
fig["layout"]["yaxis2"]["title"]="Number of train stops"
# update layout
fig.update_layout(template="plotly_white")
# only show distinct legend entries
names = set()
fig.for_each_trace(
lambda trace:
trace.update(showlegend=False)
if (trace.name in names) else names.add(trace.name))
# plot figure
fig.show()
The analysis shows that there is a slight correlation between the unemployment rate and the supply of long-distance trains. The question is whether this correlation is statistically significant or not.
The last pair of scatter plots contains the correlation of long distance train offer with the variable GDP per capita. The following question is to be answered: Are counties with higher economic activity better connected to long-distance rail service?
# create subplot object
fig = make_subplots(rows = 1,
cols=2,
column_titles=["Number of trains", "Number of train stops"])
# create left scatter plot (y = number of trains)
plotNumberOfTrains = px.scatter(regionalAnalysis,
x="countyGdpPerCap",
y="numberOfTrains",
color="region",
hover_name="countyName",
color_discrete_map={"East Germany": "red", "West Germany": "blue"},
opacity=0.9,
trendline="ols",
labels={"countyGdpPerCap": "GDP per capita",
"numberOfTrains": "Number of trains",
"region": "Region"})
# add traces
fig.add_trace(plotNumberOfTrains.data[0], row=1, col=1)
fig.add_trace(plotNumberOfTrains.data[1], row=1, col=1)
fig.add_trace(plotNumberOfTrains.data[2], row=1, col=1)
fig.add_trace(plotNumberOfTrains.data[3], row=1, col=1)
# create left scatter plot (x = number of train stops)
plotNumberOfTrainStops = px.scatter(regionalAnalysis,
x="countyGdpPerCap",
y="numberOfTrainStops",
color="region",
hover_name="countyName",
color_discrete_map={"East Germany": "red", "West Germany": "blue"},
opacity=0.9,
trendline="ols",
labels={"countyGdpPerCap": "GDP per capita",
"numberOfTrainStops": "Number of train stops",
"region": "Region"})
# add traces
fig.add_trace(plotNumberOfTrainStops.data[0], row=1, col=2)
fig.add_trace(plotNumberOfTrainStops.data[1], row=1, col=2)
fig.add_trace(plotNumberOfTrainStops.data[2], row=1, col=2)
fig.add_trace(plotNumberOfTrainStops.data[3], row=1, col=2)
# add axis titles
fig["layout"]["xaxis"]["title"]="GDP per capita"
fig["layout"]["xaxis2"]["title"]="GDP per capita"
fig["layout"]["yaxis"]["title"]="Number of trains"
fig["layout"]["yaxis2"]["title"]="Number of train stops"
# update layouts
fig.update_layout(template="plotly_white")
# only show distinct legend entries
names = set()
fig.for_each_trace(
lambda trace:
trace.update(showlegend=False)
if (trace.name in names) else names.add(trace.name))
# plot figure
fig.show()
The analysis shows that there is a positive correlation between GDP per capita and the supply of long-distance trains, i.e. an economically strong country has a better supply of long-distance trains. However, the correlation is quite noisy.
The final analysis in this section is intended to answer the question of whether or not eastern Germany (former DDR and Berlin) offers better long-distance access than western Germany. This is done on the basis of the number of train stops. Since the two regions are of different sizes, the variable number of train stops is scaled by the number of inhabitants and multiplied by 100,000 (for readability reasons).
# aggregate county population number of train stops
tmp = regionalAnalysis[["region", "numberOfTrainStops", "countyPop"]].groupby("region", as_index=False).sum()
# calculate number of trains stops per 100 thousand inhabitants
tmp["trainStopsPer100k"] = (tmp["numberOfTrainStops"] / tmp["countyPop"])*100000
# update traces
# create the chart
fig = px.bar(tmp,
x="region",
y="trainStopsPer100k",
title = "Difference between East and West Germany",
labels={"trainStopsPer100k": "Number of train stops per 100k inhabitants",
"region": "Region"})
# update traces
fig.update_traces(marker_color="#ec0016")
# show figure
fig.show()
The analysis shows that there is no significant difference between western and eastern Germany in terms of long-distance train service.
This section analyzes the stations where long-distance trains stop.
First, a map of Germany is displayed with all stations. The size and color of the markers depends on the number of long-distance trains.
# query data
query = """MATCH (s:Station)-[a:TRAIN]-()
RETURN s.stationKey as stationKey,
s.name as station,
s.latitude as latitude,
s.longitude as longitude,
COUNT(DISTINCT a.trainNumber) AS numberOfTrains"""
stationAnalysis = conn.query(query, returnDataFrame=True)
# create figure object
figStationLoc = px.scatter_mapbox(stationAnalysis,
lat="latitude",
lon="longitude",
color="numberOfTrains",
size="numberOfTrains",
size_max=40,
hover_name="station",
hover_data=["numberOfTrains"],
zoom=5,
center=dict(lat=51.08342, lon=10.42345),
color_continuous_scale=['#646973', '#ec0016'])
# update geos
figStationLoc.update_geos(scope="europe")
# update layout
figStationLoc.update_layout(mapbox_style="carto-positron",
height=600,
autosize=True,
margin={"r": 0, "t": 0, "l": 0, "b": 0},
paper_bgcolor="#303030",
plot_bgcolor="#303030")
# show map
figStationLoc.show()
The map shows that Frankfurt Hbf, Berlin Hbf and Hannover Hbf are the stations with the most long-distance trains.
In this section, the most central station is determined using various centrality measures. The measures are calculated on the database server using procedures of the neo4j plugin Graph Data Science Library.
Before one can use the centrality measure algorithms, a graph protection must be created in the graph catalog. A graph projection can be described as a materialized view of the stored graph that contains only a subset of the information relevant to the analysis. The data of such a projection is compressed and stored in the memory (see neo4j documentation).
A graph protection is created with the procedure gds.graph.project(). Our protection has the name bahn and contains all nodes with the name Station and the edges with the name Train. Two stations can be connected by several trains (parallel relations). For the calculation of centrality measures, we want to have only one relationship between two stations. Therefore, we aggregate all relationships between two stations into a single relationships in the projection. The new relation has the property numberOfTrains, which counts how many trains connect the two stations.
# drop graph projection if exists
query = """CALL gds.graph.drop('bahn', false) YIELD graphName;"""
conn.query(query, returnDataFrame=True)
| graphName | |
|---|---|
| 0 | bahn |
# create new graph projection
query = """CALL gds.graph.project('bahn',
'Station',
{
TRAIN: {
orientation: "UNDIRECTED",
properties: {
numberOfTrains: {
property: '*',
aggregation: 'COUNT'
}
}
}
})
YIELD
graphName AS graph,
nodeProjection,
nodeCount AS nodes,
relationshipCount AS rels;"""
conn.query(query, returnDataFrame=True)
| graph | nodeProjection | nodes | rels | |
|---|---|---|---|---|
| 0 | bahn | {'Station': {'label': 'Station', 'properties':... | 347 | 1226 |
Degree Centrality
First, the degree centrality is calculated. This indicator measures the number of connections a station has to other stations. With the exception of the start or end station, each train that passes through a station increases the station's degree centrality by two (connection to the previous and subsequent station). The degree centrality is caculated with the procedure gds.degree.stream().
query = """CALL gds.degree.stream('bahn',
{relationshipWeightProperty:"numberOfTrains"})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS station, score
ORDER BY score DESC"""
degCe = conn.query(query, returnDataFrame=True).head(10)
degCe = degCe["station"]
Eigenvector Centrality
The eigenvector centrality takes into account how well a station is connected to other important stations. It measures the "transitive influence of the nodes" (see neo4j documentation). The eigenvector centrality is calculated with the procedure gds.eigenvector.stream().
query = """CALL gds.eigenvector.stream('bahn', {relationshipWeightProperty:"numberOfTrains"})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS station, score
ORDER BY score DESC"""
eiCe = conn.query(query, returnDataFrame=True).head(10)
eiCe = eiCe["station"]
Closeness Centrality
The proximity centrality of a station is determined by the proximity of a station to all other stations. For each station, the shortest distance to all other stations is calculated and summed up. The closeness centrality score is determined from the inverse of this sum (see neo4j documentation). The closeness centrality is calculated with the procedure gds.beta.closeness.stream().
query = """CALL gds.beta.closeness.stream('bahn')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS station, score
ORDER BY score DESC"""
clCe = conn.query(query, returnDataFrame=True).head(10)
clCe = clCe["station"]
Betweeness Centrality
The betweeness centrality indicates how often a station occurs on the shortest paths between two stations. To calculate this measure, the shortest paths between all pairs of nodes must be created (see neo4j documentation). The betweenness centrality is calculated with the procedure gds.betweenness.stream().
Note: The shortest paths are calculated differently than in the Vue JS app because the betweeness centrality calculation procedure does not support weighted shortest paths.
query = """CALL gds.betweenness.stream('bahn')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS station, score
ORDER BY score DESC"""
betCe = conn.query(query, returnDataFrame=True).head(10)
betCe = betCe["station"]
Results
The following table shows the results of the different centrality measures.
summary = pd.DataFrame(degCe)
summary = summary.rename(columns={"station":"degree centrality"})
summary["eigenvector centrality"] = eiCe
summary["closeness centrality"] = clCe
summary["betweeness centrality"] = betCe
summary
| degree centrality | eigenvector centrality | closeness centrality | betweeness centrality | |
|---|---|---|---|---|
| 0 | Berlin Hbf | Berlin Hbf | Frankfurt(Main)Hbf | Frankfurt(Main)Hbf |
| 1 | Hannover Hbf | Hannover Hbf | Hannover Hbf | Hannover Hbf |
| 2 | Hamburg Hbf | Hamburg Hbf | Mannheim Hbf | Mannheim Hbf |
| 3 | Frankfurt(Main)Hbf | Frankfurt(Main)Hbf | Frankfurt(M) Flughafen Fernbf | Hamburg Hbf |
| 4 | Frankfurt(M) Flughafen Fernbf | Mannheim Hbf | Hamburg Hbf | Stuttgart Hbf |
| 5 | Berlin-Spandau | Frankfurt(M) Flughafen Fernbf | Berlin-Spandau | Erfurt Hbf |
| 6 | Mannheim Hbf | Berlin-Spandau | Köln Hbf | Nürnberg Hbf |
| 7 | Düsseldorf Hbf | Nürnberg Hbf | Erfurt Hbf | Berlin Hbf |
| 8 | Köln Hbf | Stuttgart Hbf | Göttingen | Berlin-Spandau |
| 9 | Duisburg Hbf | Hamburg Dammtor | Darmstadt Hbf | Köln Hbf |
According to the degree and eigenvector centrality, Berlin Hbf is the most central station. According to closeness and betweenness centrality, Frankfurt Hbf is the most central station. This is not surprising, since Frankfurt is located in the center of Germany and therefore many destinations can be reached quickly from there and many connections between two stations run via Frankfurt.
Note: Frankfurt Hbf has more trains than Berlin Hbf (see chapter 1.3.1), while Berlin Hbf has more connections to other stations (see degree of centrality). This is due to the fact that many trains start and end at Frankfurt Hbf (connection to one station only), while Berlin Hbf is often a stopover of the train journey (connection to two stations). This is shown by the following query:
query = """MATCH (n:Station)-[t:TRAIN]-()
WHERE n.name = "Berlin Hbf" OR n.name = "Frankfurt(Main)Hbf"
WITH n.name AS station, t.trainNumber as trainNumber, count(t.trainNumber) AS connections
WHERE connections = 1
RETURN station, COUNT(*) AS startingAndEndingTrains"""
conn.query(query, returnDataFrame=True)
| station | startingAndEndingTrains | |
|---|---|---|
| 0 | Berlin Hbf | 15 |
| 1 | Frankfurt(Main)Hbf | 134 |
The last section analyzes the characteristics of the trains.
The first question to answer is: What are the most common types of trains?
This question is answered with the following bar chart.
# query data
query = """MATCH ()-[train:TRAIN]->()
WITH DISTINCT train.trainType as trainType, train.trainNumber as trainNumber
RETURN DISTINCT trainType, COUNT(trainNumber) as count
ORDER BY count ASC"""
data = conn.query(query, returnDataFrame=True)
# plot results
fig = px.bar(data,
y="trainType",
x="count",
template="simple_white",
title="Trains per train type",
orientation="h",
labels={"trainType": "Train type",
"count": "Number of trains"})
# update traces
fig.update_traces(marker_color="#ec0016")
# show plot
fig.show()
The most common type of long-distance train is the ICE (InterCity Express). In second place is the IC (InterCity).
Other, not so common train types are:
Note: This project only deals with long-distance trains. The train type RE (regional express) is not a long-distance train. The RE trains in this table are IC trains that are operated as regional express trains from a specific station.
Then, the longest train is determined according to the pure travel time (the time spent in the stations is not taken into account).
query = """MATCH (s:Station)-[train:TRAIN]->(z:Station)
WITH s, train, z,
point({longitude: s.longitude, latitude: s.latitude}) as startPoint,
point({longitude: z.longitude, latitude: z.latitude}) as endPoint
ORDER BY train.trainNumber, train.lag
RETURN train.trainNumber AS trainNumber,
COLLECT(s.name)[0] AS startStation,
COLLECT(z.name)[size(COLLECT(z.name))-1] AS endStation,
ROUND(SUM(train.duration)/60.0 ,2)AS hours,
SUM(round(point.distance(startPoint, endPoint)/1000)) AS travelDistance
ORDER By hours DESC"""
conn.query(query, returnDataFrame=True).head(10)
| trainNumber | startStation | endStation | hours | travelDistance | |
|---|---|---|---|---|---|
| 0 | IC 2216 | Offenburg | Greifswald | 11.33 | 1105.0 |
| 1 | ICE 699 | Hamburg-Altona | München Hbf | 10.62 | 1085.0 |
| 2 | IC 2213 | Ostseebad Binz | Stuttgart Hbf | 10.55 | 965.0 |
| 3 | ICE 619 | Kiel Hbf | München Hbf | 10.45 | 1005.0 |
| 4 | ICE 927 | Hamburg-Altona | Passau Hbf | 10.40 | 950.0 |
| 5 | ICE 639 | Ostseebad Binz | München Hbf | 10.23 | 988.0 |
| 6 | ICE 618 | München Hbf | Kiel Hbf | 10.20 | 1015.0 |
| 7 | ICE 1020 | Passau Hbf | Hamburg-Altona | 10.18 | 951.0 |
| 8 | ICE 633 | Ostseebad Binz | München Hbf | 10.15 | 988.0 |
| 9 | ICE 632 | München Hbf | Ostseebad Binz | 10.13 | 988.0 |
The longest train by travel time is IC 2216, which travels from Offenburg to Greifswald in around 11.33 hours. The train covers a distance of around 1105 km.
The three longest trains are shown on the following map of Germany.
# top 3 trains
trains = ["IC 2216", "ICE 699", "IC 2213"]
# query for stations & connections
queryStation = 'MATCH (s:Station)-[train:TRAIN]-() WHERE train.trainNumber IN {ids} RETURN DISTINCT s.name AS stationName, s.latitude AS latitude, s.longitude AS longitude'
queryTrain = '''MATCH (s:Station)-[train:TRAIN {{trainNumber: "{id}"}} ]->(z:Station)
RETURN train.trainNumber AS trainNumber, s.latitude AS startLat, s.longitude AS startLong, z.latitude AS endLat, z.longitude AS endLong, train.lag AS lag ORDER BY train.lag'''
# get all stations
stations = conn.query(queryStation.format(
ids=trains), returnDataFrame=True)
# create figure (mapbox)
fig = px.scatter_mapbox(stations,
lat="latitude",
lon="longitude",
hover_name="stationName",
zoom=5,
center=dict(lat=51.08342, lon=10.42345))
# update geos (europe)
fig.update_geos(scope="europe")
# add traces for each train
for train in trains:
# some variables
latList = []
lonList = []
# get route for train
trainRoute = conn.query(queryTrain.format(id=train), returnDataFrame=True).to_dict('records')
# convert trainRoute to (lonList + latList)
for index, t in enumerate(trainRoute):
# only first item
if index == 0:
latList.append(t['startLat'])
lonList.append(t['startLong'])
# every item
latList.append(t['endLat'])
lonList.append(t['endLong'])
# add trace
fig.add_trace(go.Scattermapbox(mode="lines", lon=lonList, lat=latList, name=train))
# update layout
fig.update_layout(mapbox_style="carto-positron",
height=600,
autosize=True,
margin={"r": 0, "t": 0, "l": 0, "b": 0},
paper_bgcolor="#303030",
plot_bgcolor="#303030")
# show map
fig.show()
In this subsection, the longest and shortest stop distances are determined based on travel time.
We start with the longest stop distance between two stations.
query = """MATCH (s)-[train:TRAIN]->(z)
WITH s, train, z,
point({longitude: s.longitude, latitude: s.latitude}) as startPoint,
point({longitude: z.longitude, latitude: z.latitude}) as endPoint
RETURN s.name as startStation,
train.trainNumber AS trainNumber,
ROUND(SUM(train.duration)/60.0 ,2)AS hours,
train.duration AS minutes,
z.name as endStartion,
round(point.distance(startPoint, endPoint)/1000) AS travelDistance
ORDER BY minutes DESC
LIMIT 10"""
conn.query(query, returnDataFrame=True)
| startStation | trainNumber | hours | minutes | endStartion | travelDistance | |
|---|---|---|---|---|---|---|
| 0 | Berlin-Spandau | ICE 1154 | 3.83 | 230 | Köln Hbf | 465.0 |
| 1 | Berlin-Spandau | ICE 1156 | 3.83 | 230 | Köln Hbf | 465.0 |
| 2 | Berlin-Spandau | ICE 1158 | 3.82 | 229 | Köln Hbf | 465.0 |
| 3 | Köln Hbf | ICE 1159 | 3.75 | 225 | Berlin-Spandau | 465.0 |
| 4 | Köln Hbf | ICE 1155 | 3.75 | 225 | Berlin-Spandau | 465.0 |
| 5 | Köln Hbf | ICE 1157 | 3.75 | 225 | Berlin-Spandau | 465.0 |
| 6 | München Ost | NJ 468 | 3.43 | 206 | Karlsruhe Hbf | 255.0 |
| 7 | Würzburg Hbf | NJ 490 | 3.02 | 181 | Göttingen | 193.0 |
| 8 | Berlin Hbf | D 300 | 3.02 | 181 | Hamburg Hbf | 252.0 |
| 9 | Essen Hbf | ICE 1038 | 2.77 | 166 | Hamburg Hbf | 310.0 |
The longest travel time between two stations can be achieved with an ICE between Cologne and Berlin (almost 4 hours). The train covers a distance of about 465 km without stopping. The trains are a so called ICE Sprinter (see tag24).
Next, the shortest travel time between two stations is determined.
query = """MATCH (s)-[train:TRAIN]->(z)
WITH s, train, z,
point({longitude: s.longitude, latitude: s.latitude}) as startPoint,
point({longitude: z.longitude, latitude: z.latitude}) as endPoint
RETURN s.name as startStation,
train.trainNumber AS trainNumber,
train.duration AS minutes,
z.name as endStartion,
round(point.distance(startPoint, endPoint)/1000,2) AS travelDistance
ORDER BY minutes ASC
LIMIT 10"""
conn.query(query, returnDataFrame=True)
| startStation | trainNumber | minutes | endStartion | travelDistance | |
|---|---|---|---|---|---|
| 0 | Norddeich Mole | IC 2209 | 1 | Norddeich | 0.33 |
| 1 | Norddeich Mole | IC 2205 | 1 | Norddeich | 0.33 |
| 2 | Norddeich Mole | IC 2321 | 1 | Norddeich | 0.33 |
| 3 | Norddeich Mole | IC 2039 | 1 | Norddeich | 0.33 |
| 4 | Norddeich Mole | IC 2435 | 1 | Norddeich | 0.33 |
| 5 | Norddeich Mole | IC 2203 | 1 | Norddeich | 0.33 |
| 6 | Norddeich Mole | IC 2037 | 1 | Norddeich | 0.33 |
| 7 | Eutingen im Gäu | IC 2286 | 2 | Ergenzingen | 2.30 |
| 8 | Hamburg Hbf | ICE 580 | 3 | Hamburg Dammtor | 1.45 |
| 9 | Hamburg Hbf | IC 2310 | 3 | Hamburg Dammtor | 1.45 |
The shortest distance of a train is between Norddeich Mole and Norddeich. The distance is 0.3 km. All trains need 1 minute for this distance.
This subsection deals with the catering on trains. On the one hand, it will be analyzed in which trains catering (on-board restaurant or on-board bistro) is offered. On the other hand, it will be analyzed when catering is offered. In some trains the catering does not open at the starting station of the train or the catering closes before the final station. These issues are analyzed using a heat map. The hour of departure is plotted on the x-axis and the type of train is plotted on the y-axis. The color depends on the percentage of trains with a catering service.
# query data
query = """MATCH (s:Station)-[train:TRAIN]->(z:Station)
WITH toInteger(left(train.depTime,2)) AS departureHour,
train.trainType AS trainType,
COUNT(train.trainNumber) AS trainLagsAll,
COUNT(CASE WHEN train.bordrestaurant = true THEN 1 END) AS trainLagsRestaurant,
COUNT(CASE WHEN train.bordbistro = true THEN 1 END) AS trainLagsBistro,
COUNT(CASE WHEN train.bordrestaurant = true OR train.bordbistro = true THEN 1 END) AS trainLagsCatering
RETURN
trainType,
departureHour,
round(toFloat(trainLagsCatering) /toFloat(trainLagsAll)*100,2) AS Catering
ORDER BY departureHour"""
# execute query
cateringTrainHour = conn.query(query, returnDataFrame=True)
# only train types with more than 10 trains per day (see chapter 1.4.1)
cateringTrainHour = cateringTrainHour[cateringTrainHour["trainType"].isin(
["ICE", "EC", "IC", "TGV", "NJ", "RJX", "ECE"])]
# pivot data
cateringTrainHour = cateringTrainHour.pivot(index="trainType",
columns="departureHour",
values="Catering")
# create image
fig = px.imshow(cateringTrainHour,
title="Catering offer in long-distance trains",
labels={"color": "Trains with catering in %"},
color_continuous_scale=['#646973', '#ec0016'])
# update axis names
fig.update_layout(yaxis_title="Train type", xaxis_title="Departure Hour")
# show plot
fig.show()
The heatmap shows only train types with more than 10 trains per day.
The analysis shows that the RJX, ECE and TGV offer catering at all times. Almost all EC also offer catering services at all times. NJ overnight trains do not offer catering services at any time.
The catering on offer on IC and ICE trains increases in the morning and decreases in the evening. During the day, almost all ICE trains offer catering, while only around 30 percent of IC trains do.
The last subsection will analyze which types of trains can carry bicycles.
query = """MATCH (s:Station)-[train:TRAIN]->(z:Station)
WITH train.trainType AS trainType,
COUNT(train.trainNumber) AS trainLagsAll,
COUNT(CASE WHEN train.bicycle = true THEN 1 END) AS trainLagsBicycle
RETURN
trainType,
round(toFloat(trainLagsBicycle) /toFloat(trainLagsAll)*100,2) AS Bicycle
ORDER BY Bicycle ASC"""
data = conn.query(query, returnDataFrame=True)
# plot results
fig = px.bar(data,
y="trainType",
x="Bicycle",
template="simple_white",
title="Trains with bicycle in percent (%)",
orientation="h",
labels={"trainType":"Train type",
"Bicycle":"Trains with bicycle in %"})
# update traces
fig.update_traces(marker_color="#ec0016")
# show plot
fig.show()
The bar chart shows that bicycle transport varies depending on the type of train. Bicycles can be carried in all RJX and in almost all IC. In contrast, only half of the ICE can carry bicycles.